Automatic Decomposition of Multi-Author Documents Using Grammar Analysis
نویسندگان
چکیده
The task of text segmentation is to automatically split a text document into individual subparts, which differ according to specific measures. In this paper, an approach is presented that attempts to separate text sections of a collaboratively written document based on the grammar syntax of authors. The main idea is thereby to quantify differences of the grammatical writing style of authors and to use this information to build paragraph clusters, whereby each cluster is assigned to a different author. In order to analyze the style of a writer, text is split into single sentences, and for each sentence a full parse tree is calculated. Using the latter, a profile is computed subsequently that represents the main characteristics for each paragraph. Finally, the profiles serve as input for common clustering algorithms. An extensive evaluation using different English data sets reveals promising results, whereby a supplementary analysis indicates that in general common classification algorithms perform better than clustering approaches.
منابع مشابه
A Multi-Document Multi-Lingual Automatic Summarization System
Abstract. In this paper, a new multidocument multi-lingual text summarization technique, based on singular value decomposition and hierarchical clustering, is proposed. The proposed approach relies on only two resources for any language: a word segmentation system and a dictionary of words along with their document frequencies. The summarizer initially takes a collection of related documents, a...
متن کاملA Fault Diagnosis Method for Automaton based on Morphological Component Analysis and Ensemble Empirical Mode Decomposition
In the fault diagnosis of automaton, the vibration signal presents non-stationary and non-periodic, which make it difficult to extract the fault features. To solve this problem, an automaton fault diagnosis method based on morphological component analysis (MCA) and ensemble empirical mode decomposition (EEMD) was proposed. Based on the advantages of the morphological component analysis method i...
متن کاملWavelet Packet Based Texture Features for Automatic Script Identification
In a multi script environment, an archive of documents printed in different scripts is in practice. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify the script type of the document. In this paper, a novel texture-based approach is presented to identify the script type of the collection of documents printed in ten Indian scripts ...
متن کاملGlobal Approach for Script Identification using Wavelet Packet Based Features
In a multi script environment, an archive of documents having the text regions printed in different scripts is in practice. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify different script regions of the document. In this paper, a novel texture-based approach is presented to identify the script type of the collection of documen...
متن کاملCreating Algorithmic Symbols to Enhance Learning English Grammar
This paper introduces a set of English grammar symbols that the author has developed to enhance students’ understanding and consequently, application of the English grammar rules. A pretest-posttest control-group design was carried out in which the samples were students in two girls’ senior high schools (N=135, P ≤ 0.05) divided into two groups: the Treatment which received gramm...
متن کامل